On Efficiency of Dataset Filtering Implementations in Constraint-Based Discovery of Frequent Itemsets
نویسندگان
چکیده
Discovery of frequent itemsets is one of the fundamental data mining problems. Typically, the goal is to discover all the itemsets whose support in the source dataset exceeds a user-specified threshold. However, very often users want to restrict the set of frequent itemsets to be discovered by adding extra constraints on size and contents of the itemsets. Many constraint-based frequent itemset discovery techniques have been proposed recently. One of the techniques, called dataset filtering, is based on the observation that for some classes of constraints, itemsets satisfying them can only be supported by transactions that satisfy the same constraints. Conceptually, dataset filtering transforms a given data mining task into an equivalent one operating on a smaller dataset. In this paper we discuss possible implementations of dataset filtering, evaluating their strengths and weaknesses.
منابع مشابه
Dataset Filtering Techniques in Constraint-Based Frequent Pattern Mining
Many data mining techniques consist in discovering patterns frequently occurring in the source dataset. Typically, the goal is to discover all the patterns whose frequency in the dataset exceeds a userspecified threshold. However, very often users want to restrict the set of patterns to be discovered by adding extra constraints on the structure of patterns. Data mining systems should be able to...
متن کاملMINING FUZZY TEMPORAL ITEMSETS WITHIN VARIOUS TIME INTERVALS IN QUANTITATIVE DATASETS
This research aims at proposing a new method for discovering frequent temporal itemsets in continuous subsets of a dataset with quantitative transactions. It is important to note that although these temporal itemsets may have relatively high textit{support} or occurrence within particular time intervals, they do not necessarily get similar textit{support} across the whole dataset, which makes i...
متن کاملData sanitization in association rule mining based on impact factor
Data sanitization is a process that is used to promote the sharing of transactional databases among organizations and businesses, it alleviates concerns for individuals and organizations regarding the disclosure of sensitive patterns. It transforms the source database into a released database so that counterparts cannot discover the sensitive patterns and so data confidentiality is preserved ag...
متن کاملParameterless Data Compression and Noise Filtering Using Association Rule Mining
The explosion of raw data in our information age necessitates the use of unsupervised knowledge discovery techniques to understand mountains of data. Cluster analysis is suitable for this task because of its ability to discover natural groupings of objects without human intervention. However, noise in the data greatly affects clustering results. Existing clustering techniques use density-based,...
متن کاملDisClose : discovering colossal closed itemsets from high dimensional datasets via a compact row-tree
Data mining is an essential part of knowledge discovery, and performs the extraction of useful information from a collection of data, so as to assist human beings in making necessary decisions. This thesis describes research in the field of itemset mining, which performs the extraction of a set of items that occur together in a dataset, based on a user specified threshold. Recent focus of items...
متن کامل